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ABSTRACT 

This work introduces a runtime model for managing com- 
munication with support for latency-hiding. The model en- 
ables non-computer science researchers to exploit commu- 
nication latency-hiding techniques seamlessly. For compiled 
languages, it is often possible to create efficient schedules for 
communication, but this is not the case for interpreted lan- 
guages. By maintaining data dependencies between sched- 
uled operations, it is possible to aggressively initiate com- 
munication and lazily evaluate tasks to allow maximal time 
for the communication to finish before entering a wait state. 
We implement a heuristic of this model in DistNumPy, an 
auto-parallelizing version of numerical Python that allows 
sequential NumPy programs to run on distributed memory 
architectures. Furthermore, we present performance com- 
parisons for eight benchmarks with and without automatic 
latency-hiding. The results shows that our model reduces 
the time spent on waiting for communication as much as 
27 times, from a maximum of 54% to only 2% of the total 
execution time, in a stencil application. 

1. INTRODUCTION 

There are many ways to categorize scientific applications - 
terms including scalability, communication pattern, IO and 
so forth. In the following, we wish to differentiate between 
large maintained codes, often commercial or belonging to a 
community, and smaller, less organized, codes that are used 
by individual researchers or in a small research group. The 
large codes are often fairly static and each version of the 
code can be expected to be run many times by many users, 
and thus justifying a large investment in writing the code. 
The small development codes on the other hand, change fre- 
quently and may only be run a few times after each change, 
usually only by the one user who made the changes. 

The consequence of these two patterns is that the large 
codes may be written in a compiled language with explicit 
message-passing. While the small codes have an inherent 
need to be written in a high-productivity programming lan- 



guage, where the development time is drastically reduced 
compared to a compiled language with explicit message- 
passing. 

High-productivity languages such as Matlab and Python - 
popular languages for scientific computing - are generally 
accepted as being slower than compiles languages, but more 
importantly they are inherently sequential and while intro- 
ducing parallelism is possible in these languages jT] [2] [3] it 
limits the productivity. It has been previously shown that 
it is possible to parallelize matrix and vector-based data- 
structures from Python, even on distributed memory archi- 
tecturesjj. However, the parallel execution speed is severely 
impeded by communication between nodes, in such a scheme 
for automatic parallelization. 

To obtain performance in manual parallelization the pro- 
grammer usually applies a technique known as latency-hid- 
ing, which is a well-known technique to improve the per- 
formance and scalability of communication bound problems 
and is mandatory in many scientific computations. 

In this paper, we introduce an abstract model to handle 
latency-hiding at runtime. The target is scientific applica- 
tions that make use of vectorized computation. The model 
enables us to implement latency-hiding into high-produc- 
tivity programming languages where the runtime system 
handles communication and parallelization exclusively. 

In such high-productivity languages, a key feature is auto- 
matic distribution, parallelization and communication that 
are transparent to the user. Because of this transparency, 
the runtime system has to handle latency-hiding without 
any help. Furthermore, the runtime system has no knowl- 
edge of the communication pattern used by the user. A 
generic model for latency-hiding is therefore desirable. 

The transparent latency-hiding enables a researcher that 
uses small self-maintained programs, to use a high-produc- 
tivity programming language, Python in our case, with- 
out sacrificing the possibility of utilizing scalable distributed 
memory platforms. The purpose of the work is not that the 
performance of an application, which is written in a high- 
productivity language, should compete with that of a manu- 
ally parallelized compiled application. Rather the purpose is 
to close the gap between high-productivity on a single CPU 
and high performance on a parallel platform and thus have 
a high-productivity environment for scalable architectures. 



The latency-hiding model proposed in this paper is tailored 
to parallel programming languages and libraries with the 
following properties: 

• The programming language requires dynamic schedul- 
ing at runtime because it is interpreted. 

• The programming language supports and utilizes a dis- 
tributed memory environment. 

• All parallel processes have a global knowledge of the 
data distribution and computation. 

• The programming language makes use of data paral- 
lelism in a Single Instruction, Multiple Data (SIMD) 
fashion in the sense that data affinity dictates the dis- 
tribution of the computation. 

Distributed Numerical Python (DistNumPy) is an example 
of such a parallel library, and the first project that fully 
incorporate our latency-hiding model. The implementation 
of DistNumPy is open-source and freely availably. 

The rest of the paper is organized as follows. In section 2, 
3 and 4, we go through the background and theory of our 
latency-hiding model. In section 5, we describe how we use 
our latency-hiding model in DistNumPy. In section 6, we 
present a performance study. Section 7 is future work, and 
finally in section 8 we conclude. 

2. RELATED WORK 

Libraries and programming languages that support paral- 
lelism in a high productive manner is a well-known concept. 
In a perfect framework, all parallelism introduced by the 
framework is completely transparent to the user while the 
performance and scalability achieved is optimal. However, 
most frameworks require the user to specify some kind of 
parallelism - either explicitly by using parallel directives or 
implicitly by using parallel data structures. 

In this paper we will focus on data parallel frameworks, in 
which parallelism is based on the exploitation of data local- 
ity. A classical example of such a framework is High Per- 
formance Fortran (HPF) 5 , which is an extension of the 
Fortran-90 programming language. HPF introduces paral- 
lelism primarily with vector operations, which, in order to 
archive good performance, must be aligned by the user to 
reduce communication. However, a lot of work has been put 
into eliminating this alignment issue either at compile-time 
or run-time [6] [7] [8]. 

DistNumPy[4] is a library for doing numerical computation 
in Python that targets scalable distributed memory archi- 
tectures. DistNumPy accomplishes this by extending the 
NumPy module 9 , which combined with Python forms a 
popular framework for scientific computations. The paral- 
lelization introduced in DistNumPy focuses on parallel vec- 
tor operations like HPF, but because of the latency-hiding 
we introduce in this paper, it is not a requirement to align 
vectors in order to achieve good performance. 

DistNumPy is available at 

http : / /code . google . com/p/DistNumPy 



Hardware architectures also exploit data parallelism to hid- 
ing memory latency [10] or communication latency |TT] . Like- 
wise, parallel data dependency analysis is essential in order 
to efficiently schedule instructions and avoid pipeline inter- 
locks pi] 

3. LATENCY-HIDING 

We define latency-hiding informally as in [14] - "a technique 
to increase processor utilization by transferring data via the 
network while continuing with the computation at the same 
time". When implementing latency-hiding the overall per- 
formance depends on two issues: the ability of the commu- 
nication layer to handle the communication asynchronously 
and the amount of computation that can overlap the com- 
munication - in this work we will focus on the latter issue. 

In order to maximize the amount of communication hid- 
den behind computation when performing vectorized com- 
putations our abstract latency-hiding model uses a greedy 
algorithm. The algorithm divides the arrays, and thereby 
the computation, into a number of fixed-sized data blocks. 
Since most numerical applications will work on identical di- 
mensioned datasets, the distribution of the datasets will be 
identical. For many data blocks, the location will therefore 
be the same and these will be ready for execution without 
any data transfer. While the co-located data blocks are pro- 
cessed, the transfers of the data blocks from different loca- 
tion can be carried out in the background thus implementing 
latency-hiding. The performance of this algorithm relies on 
two properties: 

• The number of data blocks must be significantly great- 
er than number of parallel processors. 

• A significat number of data blocks must share location. 

In order to obtain both properties we need a data structure 
that support easy retrieval of dependencies between data 
blocks. Furthermore, the number of data blocks in a com- 
putation is proportional with the total problem size thus 
efficiency is of utmost importance. 

4. DIRECTED ACYCLIC GRAPH 

It is well-known that a directed acyclic graph (DAG) can 
be used to express dependencies in parallel applications 15 . 
Nodes in the DAG represent operations and edges represent 
serialization dependencies between the operations, which in 
our case is due to conflicting data block accesses. 

Scheduling operations in a DAG is a well-studied problem. 
The scheduling problem is NP-complete in its general forms 
[16| where operations are scheduled such that the overall 
computation time is minimized. There exist many heuristic 
for solving the scheduling problem [17], but none match our 
target. 

The scheduling problem we solve in this paper is not NP- 
hard because we are targeting programming frameworks that 
make use of data parallelism in a SIMD fashion. The par- 
allel model we introduce is statically orchestrating data dis- 
tribution and parallelization based on predefined data affin- 
ity. Assignment of computation tasks are not part of our 



scheduling problem. Instead, our scheduling problem con- 
sists of maximizing the amount of communication that over- 
laps computation when moving data to the process that is 
predefined to perform the computation. 

In [18] the authors demonstrate that it is possible to dynamic 
schedule operations in a distributed environment using local 
DAGs. That is, each process runs a private runtime sys- 
tem and communicates with other processes regarding data 
dependences. Similarly, our scheduling problem is also dy- 
namic but in our case all processes have a global knowledge 
of the data distribution and computation. Hence, no com- 
munication regarding data dependences is required at all. 

The time complexity of insetting a node into a DAG, G = 
(V,E), is 0(V) in worse case. Building the complete DAG 
is therefore 0(V 2 ). Removing one node from the DAG is 
0(V), which means that in the case where we simply wants 
to schedule all operations in a legal order the time complex- 
ity is 0(V 2 ). This is without minimizing the overall com- 
putation or the amount of communication hidden behind 
computation. We therefore conclude that a complete DAG 
approach is inadequate for runtime control of latency-hiding 
in our case. 

We address the shortcoming of the DAG approach through 
a heuristic that manage dependencies on individual blocks. 
Instead of having a complete DAG, we maintain a list of 
depending operations for each data block. Still, the time 
complexity of scheduling all operations is 0(V 2 ) in worse 
case, but the heuristic exploits the observation that in the 
common case a scientific application spreads a vectorized 
operation evenly between the data blocks in the involved 
arrays. Thus the number of dependencies associated with a 
single data block is manageable by a simple linked list. In 
Section 15.71 we will present a practical implementation of 
the idea. 

5. DISTRIBUTED NUMERICAL PYTHON 

The programming language Python combined with the nu- 
merical library NumPy[5] has become a popular numerical 
framework amongst researchers. It offers a high level pro- 
gramming language to implement new algorithms that sup- 
port a broad range of high level operations directly on vec- 
tors and matrices. 

The idea in NumPy is to provide a numerical extension to 
the Python language that enables the Python language to be 
both high productive and high performing. NumPy provides 
not only an API to standardized numerical solvers, but also 
the option to develop new numerical solvers that are both 
implemented and efficiently executed in Python, much like 
the idea behind the Matlab[TS] framework. 

DistNumPy is a new version of NumPy that parallelizes ar- 
ray operations in a manner completely transparent to the 
user - from the perspective of the user, the difference be- 
tween NumPy and DistNumPy is minimal. DistNumPy can 
use multiple processors through the communication library 
Message Passing Interface (MPI) [20]. However, DistNumPy 
does not use the traditional single-program multiple-data 
(SPMD) parallel programming model that requires the user 
to differentiate between the MPI-processes. Instead, the 



MPI communication in DistNumPy is fully invisible and the 
user needs no knowledge of MPI or any parallel program- 
ming model. The only difference in the API of NumPy and 
DistNumPy is the array creation routines. DistNumPy al- 
lows both distributed and non-distributed arrays to co-exist, 
the user must specify, as an optional parameter, if the ar- 
ray should be distributed. The following illustrates the only 
difference between the creation of a standard array and a 
distributed array: 

#Non-Distributed 

A = numpy . array ( [1 , 2 , 3] ) 

#Dis tributed 

B = numpy . arr ay ( [1 , 2 , 3] , dist=True) 

5.1 Views 

NumPy and DistNumPy use identical arrays syntax, which 
is based on the Python list syntax. The arrays are indexed 
positionally, through length - 1, where negative indexes is 
used for indexing in the reversed order. Like the list syntax 
in Python, it is possible to index multiple elements. All in- 
dexing that represents more than one element returns a view 
of the elements rather than a new copy of the elements. This 
means that an array does not necessarily represent a com- 
plete, contiguous block of memory. It is possible to have a 
hierarchy of arrays where only one array represents a com- 
plete contiguous block of memory and the other arrays rep- 
resent a subpart of that memory. DistNumPy implements 
an array hierarchy where distributed arrays are represented 
by the following two data structures. 

Array-base is the base of an array and has direct access to 
the content of the array in main memory. An array- 
base is created with all related meta-data when the 
user allocates a new distributed array, but the user 
will never access the array directly through the array- 
base. The array-base always describes the whole array 
and its meta-data such as array size and data type. 

Array-view is a view of an array-base. The view can re- 
present the whole array-base or only a sub-part of the 
array-base. An array- view can even represent a non- 
contiguous sub-part of the array-base. An array-view 
contains its own meta-data that describes which part 
of the array-base is visible. The array-view is mani- 
pulated directly by the user and from the users per- 
spective the array-view is simply a normal contiguous 
array. 

Array-views are not allowed to refer to each other, which 
means that the hierarchy is flat with only two levels: array- 
base below array-view. However, multiple array-views are 
allowed to refer to the same array-base. This two-tier hier- 
archy is illustrated in Figure [1] 

5.2 Data Layout 

The data layout in DistNumPy consists of three kinds of 
data blocks: base-blocks, view-blocks and sub-view-blocks, 
which make up a three level abstraction hierarchy (Fig. |2J. 

Base-block is a block of an array-base. It maps directly 
into one block of memory located on one node. The 
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Figure 1: Reference hierarchy between the two array 
data structures and the main memory. Only the 
three array-views at top of the hierarchy are visible 
from the perspective of the user. 

memory block is not necessarily contiguous but only 
one MPI-process has exclusive access to the block. 
Furthermore, DistNumPy makes use of a N-Dimen- 
sional Block Cyclic Distribution inspired by High Per- 
formance Fortran 51, in which base- blocks are distributed 
across multiple MPI-processes in a round-robin fash- 
ion. 

View-block is a block of an array-view, from the perspec- 
tive of the user a view-block is a contiguous block of 
array elements. A view-block can span over multiple 
base-blocks and consequently also over multiple MPI- 
processes. For a MPFprocess to access a whole view- 
block it will have to fetch data from possibly remote 
MPFprocesses and put the pieces together before ac- 
cessing the block. To avoid this process, which may 
cause some internal memory copying, we divide view- 
blocks into sub- view-block. 

Sub-view-block is a block of data that is a part of a view- 
block but is located on only one MPFprocess. The 
driving idea is that all array operation is translated 
into a number of sub-view-block operations. 

We will define an aligned array as an array that have a direct, 
contiguous mapping through the block hierarchy. That is, a 
distributed array in which the base-blocks, view-blocks and 
sub-view-blocks are identical. A non-aligned array is then a 
distributed array without this property. 

5.3 Universal Function 

An important mechanism in DistNumPy is a concept called 
a Universal function. A universal function (ufunc) is a func- 
tion that operates on all elements in an array-view inde- 
pendently. That is, an ufunc is a vectorized wrapper for a 
function that takes a fixed number of scalar inputs and pro- 
duces a fixed number of scalar outputs. E.g., addition is an 
ufunc that takes three array-views as argument: two input 
arrays and one output array. For each element, the ufunc 
adds the two input arrays together and writes the result 
into the output array. Using ufunc can result in a signifi- 
cant performance boost compared to native Python because 




Figure 2: An illustration of the block hierarchy that 
represents a 2D distributed array. The array is di- 
vided into three block-types: Base, View and Sub- 
View-blocks. The 16 base-blocks make up the base- 
array, which may be distributed between multiple 
MPI-processes. The 9 view-blocks make up a view 
of the base-array and represent the elements that 
are visible to the user. Each view-block is further- 
more divided into four sub-view-blocks, each located 
on a single MPI-process. 

the computation-loop is implemented in C and is executed 
in parallel. 

Applying an ufunc operation on a whole array-view is se- 
mantically equivalent to performing the ufunc operation on 
each array-view block individually. This property makes it 
possible to perform a distributed ufunc operation in parallel. 
A distributed ufunc operation consists of four steps: 

1. All MPFprocesses determine the distribution of the 
view-block computation, which is strictly based on the 
distribution of the output array-view. 

2. All MPI-processes exchange array elements in such a 
mariner that each MPI-process can perform its com- 
putation locally. 

3. All MPI-processes perform their local computation. 

4. All MPI-processes send the altered array elements back 
to the original locations. 

5.4 Latency-Hiding 

The standard approach to hide communication latency be- 
hind computation in message-passing is a technique known 
as double buffering. The implementation of double buffer- 
ing is straightforward when operating on a set of data block 
that all have identical sizes. The communication of one data 
block is overlapped with the computation of another already 
communicated data block and since the sizes of all the data 
blocks are identical all iterations are identical. 

In DistNumPy, a straightforward double buffering approach 
works well for ufuncs that operate on aligned arrays, be- 
cause it translates into communication and computation op- 
erations on whole view-blocks, which does not benefit from 
latency-hiding inside view-blocks. However, for ufuncs that 
operate on non-aligned arrays this is not the case because the 
view-block is distributed between multiple MPI-processes. 
In order to achieve good scalable performance the Dist- 
NumPy implementation must therefore introduce latency- 
hiding inside view-blocks. For example the computation 



import numpy 

M = numpy . array ( [1 , 2 , 3 ,4 , 5 , 6] , \ 

dist = True ) 
N = numpy . empty (( 6) , dist = True ) 
A = M [2 : ] 
B = M [0 : 4] 
C = N [1 : 5] 
C = A + B 



Figure 3: This is an example of a small 3-point sten- 
cil application. 
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Figure 4: The data layout of the two arrays M and 
N and the three array-views A, B and C in the 3- 
point stencil application (Fig. [3]). The arrays are 
distributed between two nodes using a block-size of 
three. 

of a view-block in Figure [2] can make use of latency-hiding 
by first initiating the communication of the three non-local 
sub-view-blocks then compute the local sub- view-block and 
finally compute the three communicated sub- view-blocks. 

5.5 The Dependency Graph 

One of the key contributions in this paper is a latency- 
hiding model that, by maintaining data dependencies be- 
tween scheduled operations, is able to aggressively initiate 
communication and lazily evaluate tasks, in order to allow 
maximal time for the communication to finish before enter- 
ing a wait state. In this section, we will demonstrate the 
idea of the model by giving an example of a small 3-point 
stencil computation implemented in DistNumPy (Fig. [3J). 
For now, we will use a traditional DAG for handling the 
data dependencies. Later we will describe the implementa- 
tion of the heuristic that enables us to manage dependencies 
more efficiently. 

Additional, it should be noted that the parallel processes do 
not need to exchange dependency information since they all 
have full knowledge of the data distribution. 

Two processes are executing the stencil application and Dist- 
NumPy distributes the two arrays, M and N, using a block- 
size of three. This means that three contiguous array ele- 
ments are located on each process (Fig. [4j. Using a DAG as 
defined in section[4] figure [5] illustrates the dependencies be- 
tween 12 operations that together constitute the execution. 
Initially the following six operations are ready: 

R := {opl, op2, op3, op4, op9, oplO} 

Afterwards, without the need of communication, two more 
operations op5 and op8 may be executed. Thus, it is possible 
to introduce latency-hiding by initiating the communication, 
op6 and opl, before evaluating operation op5 and op8. The 
amount of communication latency hidden depends on the 



Figure 5: Illustration of a DAG that represents the 
dependencies in a 3-point stencil application (Fig. 
[3|). The DAG consists of 12 operations, opl to opl 2, 
divided between two processes. 

computation time of op5 and op8 and the communication 
time of op6 and opl . 

We will strictly prioritize between operations based on whe- 
ther they involve communication or computation - giving 
priority to communication over computation. Furthermore, 
we will assume that all operations take the same amount of 
time, which is a reasonable assumption in DistNumPy since 
it divides array operations into small blocks that often have 
the same computation or communication time. 

5.6 Lazy Evaluation 

Since Python is an interpreted dynamic programming lan- 
guage, it is not possible to schedule communication and com- 
putation operations at compile time. Instead, we introduce 
lazy evaluation as a technique to determine the communi- 
cation and computation operations used in the program at 
runtime. 

During the execution of a DistNumPy program all MPI- 
processes record the requested array operations rather than 
applying them immediately. The MPI-processes maintain 
the operations in a convenient data structure and at a later 
point all MPI-processes apply the operations. The idea is 
that by having a set of operations to carry out it may be 
possible to schedule communication and computation oper- 
ations that have no mutual dependencies in parallel. 

We will only introduce lazy evaluation for Python operations 
that involve distributed arrays. If the Python interpreter 
encounters operations that do not include DistNumPy ar- 
rays, the interpreter will execute them immediately. At some 
point, the Python interpreter will trigger DistNumPy to ex- 
ecute all previously recorded operation. This mechanism of 
executing all recorded operation we will call an operation 
flush and the following three conditions may trigger it. 

• The Python interpreter issues a read from distribut- 
ed data. E.g. when the interpreter reaches a branch 
statement. 

• The number of delayed operations reaches a user-de- 
fined threshold. 

• The Python interpreter reaches the end of the pro- 
gram. 



5.7 The Dependency System 

The main challenge when introducing lazy evaluation is to 
implement a dependency system that schedules operations 
in a performance efficient manner while the implementation 
keeps the overhead at an acceptable level. 

Our first lazy evaluation approach makes use of a DAG-bas- 
ed data structure to contain all recorded operations. When 
an operation is recorded, it is split across the sub-view- 
blocks that are involved in the operation. For each such 
operation, a DAG node is created just as in figure [3] and [4] 

Beside the DAG our dependency system also consist of a 
ready queue, which is a queue of recorded operations that 
do not have any dependencies. The ready queue makes it 
possible to find operations that are ready to be executed in 
the time complexity of O(l). 

Operation Insertion. The recording of an operation trig- 
gers an insertion of new node into the DAG. A straightfor- 
ward approach will simply implement insertion by compar- 
ing the new node with all the nodes already located in the 
DAG. If a dependency is detected the implementation adds 
an edge between the nodes. The time complexity of such an 
implementation is 0(n) where n is the number of operation 
in the DAG and the construction of the complete DAG is 
0(n 2 ). 



Operation Flush. To achieve good performance the oper- 
ation flush implementation must maximize the amount of 
communication that it is overlapped by computation. There- 
fore, the flush implementation initiate communication at the 
earliest point in time and only do computation when all com- 
munication has been initiated. Furthermore, to make sure 
that there is progress in the MPI layer it checks for fin- 
ished communication in between multiple computation op- 
erations. The following is the flow of our operation flush 
algorithm: 

1. Initiate all communication operations in the ready qu- 
eue. 

2. Check in a non-blocking manner if some communica- 
tion operations have finished and remove finished com- 
munication operations from the ready queue and the 
DAG. Furthermore, register operations that now have 
no dependencies into the ready queue. 

3. If there is only computation operations in the ready 
queue execute one of them and remove it from the 
ready queue and the DAG. 

4. Go back to step one if there are any operations left in 
the ready queue else we are finished. 

The algorithm maintains the following three invariants: 

1. All operations, that are ready, are located in the ready 
queue. 
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Figure 6: Illustration of the naive evaluation ap- 
proach. The result is a deadlock in the first itera- 
tion since both processes are waiting for the receive- 
node to finish, but that will never happen because 
the matching send-node is in second iteration. 

2. We only start the execution of a computation node 
when there is no communication node in the ready 
queue. 

3. We only wait for communication when the ready queue 
has no computation nodes. 

5.7.1 Deadlocks 

To avoid deadlocks a MPI-process will only enter a blocking 
state when it has initiated all communication and finished all 
ready computation. This guaranties a deadlock-free execu- 
tion but it also reduces the flexibility of the execution order. 
Still, it is possible to check for finished communication using 
non-blocking functions such as MPI_Testsome () . 

The naive approach to evaluate a DAG is simply to first eval- 
uate all nodes that have no dependencies and then remove 
the evaluated nodes from the graph and start over - similar 
to the traditional BSP model. However, this approach may 
result in a deadlock as illustrated in figure [6] 

5. 7. 2 Dependency Heuristic 

Experiments with lazy evaluation using the DAG-based data 
structure shows that the overhead associated with the cre- 
ation of the DAG is very time consuming and becomes the 
dominating performance factor. We therefore introduce a 
heuristic to speed up the common case. We base the heuris- 
tic on the following two observations: 

• In the common case, a scientific DistNumPy applica- 
tion spreads a computation evenly between all sub- 
view-blocks in the involved arrays. 

• Operation dependencies are only possible between sub- 
view-blocks that are part of the same base-block. 

The heuristic is that instead of having a DAG, we introduce 
a prioritized operation list for each base-block. The assump- 



Table 1: Hardware specifications 



CPU 


Intel Xeon E5345 


CPU Frequency 


2.33 GHz 


CPU per node 


2 


Cores per CPU 


4 


Memory per node 


16 GB 


Number of nodes 


16 


Network 


Gigabit Ethernet 



tion is that, in the common case, the number of operations 
associated with a base-block is manageable by a linked list. 

We implement the heuristic using the following algorithm. 
A number of operation-nodes and access-nodes represent the 
operation dependencies. The operation-node contains all in- 
formation needed to execute the operation on a set of sub- 
view-blocks and there is a pointer to an access-node for each 
sub-view-block. The access-node represents memory access 
to a sub-view-block, which can be either reading or writ- 
ing. E.g., the representation of an addition operation on 
three sub- view-blocks is two read access-nodes and one write 
access-node (Fig. [7]). 

Our algorithm places all access-nodes in dependency-lists 
based on the base-block that they are accessing. When an 
operation-node is recorded each associated access-node is in- 
serted into the dependency list of the sub- view-blocks they 
access. Additionally, the number of accumulated dependen- 
cies the access-nodes encounter is saved as the operation- 
node's reference counter. 
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system. 



All access-nodes that access the same 
base-block are linked together in a depen- 
dency-list. The order of the list is simply 
based on the time of node insertion (de- 
scending order). Additionally an access- 
node contains the information needed to 
determine whether it depends on other 
access-nodes. 

An operation-node has a pointer to all 
access-nodes that are involved in the op- 
eration. Attached to an operation is a 
reference counter that specifies the num- 
ber of dependencies associated with the 
operation. When the counter reaches 
zero the operation is ready for execution. 
At some point when the operation has 
been executed the operation-node and all 
access-nodes are completely remove from 
the dependency system. 

A base-block simply contains a pointer to 
the first access-node in the dependency- 
list. 



The structures used in the dependency 



All operation-nodes that are ready for execution have a refer- 
ence count of zero and are in the ready queue. Still, they may 
have references to access-nodes in dependency-lists - only 
when we execute an operation-node will we remove the as- 
sociated access-nodes from the dependency-lists. Following 
the removal of an access-node we traverse the dependency- 
list and for each depending access-node we reduce the asso- 
ciating reference counter by one. Because of this, the refer- 
ence counter of another operation-node may be reduced to 
zero, in which case we move the operation-node to the ready 
queue and the algorithm starts all over. 

Figure \7\ goes through all the structures that make up the 
dependency system and figure [8] illustrates a snapshot of 
the dependency system when executing the 3-point stencil 
application. 

6. EXPERIMENTS 

To evaluate the performance impact of the latency-hiding 
model introduced in this paper, we will conduct performance 
benchmark using DistNumPy and NumPj0. The bench- 
mark is executed on an Intel Core 2 Quad cluster (Table [TJ 
and for each application we calculate the speedup of Dist- 
NumPy compared to NumPy. The problem size is constant 
though all the executions, i.e. we are measuring strong scal- 
ing. To measure the performance impact of the latency- 
hiding, we use two different setups: one with latency-hiding 
and one that uses blocking communication. For both setups 
we measured the time spent on waiting for communication, 
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Figure 8: Illustration of the dependency system 
when executing the 3-point stencil in figure 131 HI and 
O The illustration is a snapshot of the dependency 
system on node after the creation of all the arrays. 
Note that since the block size is three, node only 
has one block of each array. 
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# Black Scholes Function 

# S : Stock price 

# X : Strike price 

# T: Years to maturity 

# r: Risk- free rate 

# v : Volatility 

def BlackScholes (CallPutFlag , S , X , T , r , 
v) : 

dl = ( log (S/X) +(r+v*v/2 . ) *T) / ( v* sqrt 
(T)) 

d2 = dl-v*sqrt (T) 

if CallPutFlag== ' c ' : 

return S*CND (dl ) -X*exp (-r*T) *CND ( 
d2) 

else : 

return X*exp (-r*T) *CND (-d2) -S*CND 
(-dl) 



i.e. the communication latency not hidden behind compu- 
tation. 

In this benchmark we utilizes the cluster in a by node fash- 
ion. That is, from one to sixteen CPU-cores we start on 
MPI-process per node (no SMP) and above sixteen CPU- 
cores we start multiple MPI-processes per node. The MPI 
library used throughout this benchmark is OpenMPQ 

The benchmark consists of the following eight Python ap- 
plications. 

Fractal Computation of the Mandelbrot Set. From a Num- 
Py tutorial written by Walt [21] (Fig. [TTj) . 

Black-Scholes Computation of the Black-Scholes model [2^] 
implemented in NumPy (Fig. [9l and [12)) . 

Both Fractal and Black-Scholes are embarrassingly 
parallel applications and we expect that latency-hiding 
will not improve the performance. 

N-body A Newtonian N-body simulation that uses a naive 
algorithm that computes all body-body interactions. 
The NumPy implementation is a translation of a MAT- 
LAB application by Casanova 23 (Fig. 1 13fl . 

kNN A naive implementation of a k nearest neighbor search 
(Fig. ED . 

The N-body and kNN applications have a computa- 
tion complexity of 0(n 2 ). This indicates that the two 
applications should have good scalability even without 
latency-hiding. 

Lattice Boltzmann 2D Lattice Boltzmann model of chan- 
nel flow in 2D using the D2Q9 model. It is a translation 
of a MATLAB application by Latt24 (Fig. [To)) . 

Lattice Boltzmann 3D Lattice Boltzmann model of a flu- 
id in 3D using the D3Q19 model. It is a translation of 
a MATLAB application by Haslam[25] (Fig. [TO)) . 

The two Lattice Boltzmann applications have a com- 
putation complexity of 0(n). More communication 
is therefore needed and we expect that latency-hiding 
will improve the performance. 

Jacobi The Jacobi method is an algorithm for determin- 
ing the solutions of a system of linear equations. It 
is an iterative method that uses a spitting scheme to 
approximate the result (Fig. [TTJ). 

Jacobi Stencil In this benchmark, we have implemented 
the Jacobi method using stencil operations rather than 
matrix row operations (Fig. H0l and[T8 )l . 

The two Jacobi applications also have a computation 
complexity of 0(n). However, the constant associated 
with n is very small, e.g. to compute one element in 
the Jacob Stencil application four adjacent elements 
are required. We expect latency-hiding to be very im- 
portant for good scalability. 



Figure 9: This is the Black Sholes Function in 
the Black-Scholes benchmark where CND is the cu- 
mulative normal distribution. Note that there is 
no source code difference between a parallel and 
a sequential version — it is regular Python/Numpy 
source code. 
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Figure 10: This is the kernel in the Jacobi Stencil 
benchmark. First, we define a view of the full ar- 
ray for each point in the stencil, five in this case, 
and then we apply the stencil until it converges. 
Note that there is no source code difference between 
a parallel and a sequential version — it is regular 
Python/Numpy source code. 
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Figure 11: Speedup of the Fractal application. 
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Figure 12: Speedup of the Black-Scholes application. 
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Figure 15: Speedup of the Lattice Boltzmann 2D 
application. 
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Figure 13: Speedup of the N-body application. 



Figure 16: Speedup of the Lattice Boltzmann 3D 
application. 
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Figure 14: Speedup of the kNN application. 



Figure 17: Speedup of the Jacobi application. 
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Figure 18: Speedup of the Jacobi Stencil applica- 
tion. 



6.1 Discussion 

Overall, the benchmarks show that DistNumPy has both 
good performance and scalability. However, the scalability 
is somewhat worsening at 32 CPU-cores and above, which 
correlates with the use of multiple CPU-cores per node. Be- 
cause of this distinct performance profile, we separate the 
following discussion into results executed on one to sixteen 
CPU-cores (one CPU-core per node) and the results exe- 
cuted on 32 CPU-cores to 128 CPU-cores (multiple CPU- 
cores per node). 

6.1.1 One to Sixteen CPU-cores 

The benchmarks clearly shows that DistNumPy has both 
good performance and scalability. Actually, half of the Py- 
thon applications achieve super-linear speedup at sixteen 
CPU-cores. This is possible because DistNumPy, opposed 
to NumPy, will try to reuse memory allocations by lazily de- 
allocating arrays. DistNumPy uses a very naive algorithm 
that simply checks if a new array allocation is identical to 
a just de-allocated array. If that is the case one memory 
allocation and de-allocation is avoided. 

In the two embarrassingly parallel applications, Fractal and 
Black-Scholes, we see very good speedup as expected. Since 
the use of communication in both applications is almost 
non-existing the latency-hiding makes no difference. The 
speedup achieved at sixteen CPU-cores is 18.8 and 15.4, re- 
spectively. 

The two naive implementations of N-body and kNN do 
not benefit from latency-hiding. In N-body the dominat- 
ing operations are matrix-multiplications, which is a native 
operation in NymPy and in DistNumPy implemented as spe- 
cialized operations using the parallel algorithm SUMMA|26| 
and not as a combination of ufunc operations. Since both 
the latency-hiding and the blocking execution use the same 
SUMMA algorithm the performance is very similar. How- 
ever, because of the overhead associated with latency-hiding, 
the blocking execution performs a bit better. The speedup 
achieved at sixteen CPU-cores is 17.2 with latency-hiding 
and 17.8 with blocking execution. Similarly, the perfor- 
mance difference between latency-hiding and blocking in 
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Figure 19: Speedup of the N-body application that 
compares by node, in which the maximum number 
of CPU-cores is used, and by core, in which the min- 
imum number of nodes is used. Note that the by 
node graph is identical to the latency-hiding graph 
in figure 1131 



kNN is minimal - the speedup achieved at sixteen CPU- 
cores is 12.5 and 12.6, respectively. The relatively modest 
speedup in kNN is the result of poor load balancing. At 
eight and sixteen CPU-cores the chosen problem is not di- 
vided evenly between the processes. 

Latency-hiding is somewhat beneficial to the two Lattice 
Boltzmann applications. The waiting time percentage on 
sixteen CPU-cores goes from 19% to 13% in Lattice Boltz- 
mann 2D, and from 16% to 9% in Lattice Boltzmann 
3D. However, the overall impact on the speedup is not that 
great, primarily because of the overhead associated with 
latency-hiding. 

Finally, latency-hiding introduces a drastically improved per- 
formance to the two communication intensive applications 
Jacobi and Jacobi Stencil. The waiting time percent- 
age going from 54% to 2% and from 62% to 9%, respec- 
tively. Latency-hiding also has a major impact on the over- 
all speedup, which goes from 5.9 to 12.8 and from 7.7 to 
18.4, respectively. In other words latency-hiding more than 
doubles the overall speedup and CPU utilization. 

6.1.2 Scaling above sixteen CPU-cores 
Overall, the performance is worsening at 32 CPU-cores - 
particular at 64 CPU-cores and above where the CPU uti- 
lization is below 50%. One reason for this performance 
degradation is the characteristic of strong scaling. In order 
to have considerable more data blocks than MPI-processes, 
the size of the data distribution blocks decreases as the num- 
ber of executing CPU-cores increases. Smaller data blocks 
reduces the performance since the overhead in DistNumPy 
is proportional with the size of a data block. 

However, smaller data blocks are not enough to justify the 
observed performance degradation. The von Neumann bot- 
tleneck 27 associated with main memory also hinder scal- 
ability. This is evident when looking at Figure 1191 which 
is a speedup graph of N-body that compares by node and 



by core scaling. At eight CPU-cores, both result uses iden- 
tical data distribution and block size, but the performance 
when only using one CPU-core per node is clearly superior 
to using all eight CPU-cores on one node. 

A NumPy application will often use ufuncs heavily, which 
makes the application vulnerable to the von Neumann bot- 
tleneck. The problem is that multiple ufunc operations are 
not pipelined in order to utilize cache locality. Instead, 
NumPy will compute a single ufunc operation at a time. 
This problem is also evident in DistNumPy since our latency- 
hiding model only address communication latency and not 
memory latency. 

7. FUTURE WORK 

The latency-hiding model introduced in this paper focuses 
on communication latency. However, the result from out 
benchmarks shows that memory latency is another aspect 
that is important for good scalability - particular when uti- 
lizing shared memory nodes. 

One approach to address this issue is to extend our latency- 
hiding model with cache locality optimization. The sched- 
uler will have to prioritize computation operations that are 
likely to be in the cache. One heuristic to accomplish this is 
to sort the operations in the ready queue after the last time 
the associated data block has been accessed. 

Another approach is to merge calls to ufuncs, that oper- 
ate on common arrays, together in one joint operation and 
thereby make the joint operation more CPU-intensive. If it 
is possible to merge enough ufuncs together the application 
may become CPU bound rather than memory bound. 

Introducing Hybrid Programming could also be a solution 
to the problem. In order to utilize hybrid architectures, |28| 
shows that shared and distributed memory programming 
can improve the overall performance and scalability. 

8. CONCLUSIONS 

While automatic parallelization for distributed memory ar- 
chitectures cannot hope to compete with a manually paral- 
lelized version, the productivity that comes with automatic 
parallelization still makes the technique of interest to a user 
who only runs a code a few times between changes. For 
applications that are embarrassingly parallel or applications 
where the computational complexity is 0(n 2 ) or higher, it 
is fairly straight forward to manage the communication for 
automatic parallelization. However, for common kernels the 
complexity is 0(nlog(n)) or even 0(n) and here the appli- 
cation of latency-hiding techniques is essential for perfor- 
mance. 

In this work we have presented a scheme for managing la- 
tency-hiding, that is based on the assumption that splitting 
the work in more blocks than there are processors will allow 
us to aggressively communicate data-blocks between nodes, 
while at the same time processing operations that require no 
external data-blocks. The same dependency analysis may be 
done without a division into data-blocks, but the blocking 
approach allows us to maintenance of a full DAG, an oper- 
ation that is known to be costly, and replace the DAG with 



a number of ordered linked lists, to which access is done in 
linear time. 

We implement the model in Distributed Numerical Python, 
DistNumPy, a programming framework that allows linear 
algebra operations expressed in NumPy to be executed on 
distributed memory platforms and this is without any effort 
towards parallelization from the programmer. 

A selection of eight benchmarks show that the system, as 
predicted, does not improve the performance of embarrass- 
ingly parallel applications or applications with complexity 
0(n 2 ) or higher. For applications with lower complexity 
the benefit from automatic latency-hiding is highly depen- 
dent on the relationship between the amount of data that 
needs to be transferred and the cost of updating the indi- 
vidual elements. In the Lattice Boltzmann codes, both 2D 
and 3D, the version without latency-hiding does quite well, 
simply because the operation of updating a data-point is 
time consuming enough to amortize the communication la- 
tency. However, when the cost of communication becomes 
more significant, as in a Jacobi solver and stencil-based Ja- 
cobi solver, the automatic latency-hiding significantly im- 
prove the performance. The performance of the stencil- 
based Jacobi-solver improves from a speedup of 7.7 to 18.4 
at sixteen processors and 8.6 to 25.0 at 128 processors, com- 
pared to standard sequential NumPy. This is matched by 
the fact that the time spend on waiting for communication 
drops from 62% to 9% and 87% to 41%, respectively, with 
the introduction of latency-hiding. 

Overall, the conclusion is that managing latency-hiding at 
runtime is fully feasible and makes automatic parallelization 
feasible for a number of applications where manual paral- 
lelization would otherwise be required. The most obvious 
target is the large base of stencil-based algorithms. 
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